DS 7333 - Quantifying the World

Case Study #5: "Working with Missing Data"

Sterling Beason, Sean Kennedy, Emil Ramos

Introduction

In this case study, we will be working with missing data using the Boston house-price data. We will simply be exploring the change in results by performing imputations for the missing values for multiple variables.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from IPython.display import Markdown as md
import seaborn as sns
In [3]:
def take_two(data,
             null_cols=[],
             test_col='AGE',
             cond='>10',
             impute_func=np.mean,
             samples=[.1, .2, .3]):
    assert callable(impute_func)
    assert isinstance(test_col, str)
    assert isinstance(cond, str)
    data_frames = {}
    for sample_rate in samples:
        df = data.copy()
        replace_index = df.query(f'{test_col}{cond}').sample(
            frac=sample_rate, replace=False).index
        print(f'Percent of imputed values:{(len(replace_index)/len(df.index)):,.2%}')
        df.loc[replace_index, null_cols] = None
        imputed = impute_func(df.loc[df.index.difference(replace_index),
                                     null_cols],
                              axis=0)
        print(f'{imputed}\n{impute_func}\n{sample_rate}', end='\n')
        [
            df[x].fillna(imputed[i], inplace=True)
            for i, x in enumerate(null_cols)
        ]
        data_frames.update({sample_rate: df})
    return data_frames
In [2]:
# load data
boston = load_boston()
In [5]:
# display data descriptions
md(boston.DESCR)
Out[5]:

.. _boston_dataset:

Boston house prices dataset

Data Set Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

.. topic:: References

  • Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
  • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

DataFrame Creation & Data Partitioning

In [3]:
# create independent and dependent dataframes
df = pd.DataFrame(boston.data, columns=boston.feature_names) # features dataframe
df_targets = pd.DataFrame(boston.target, columns=['target']) # targets dataframe
In [4]:
# partition data
X_train, X_test, y_train, y_test = train_test_split(df, df_targets.target, test_size=0.33, random_state=7337)

Explore Data

Feature Correlation

In [7]:
fig, ax = plt.subplots(figsize=(12,12))
im = ax.matshow(df.corr())
ax.set_xticks(np.arange(len(df.columns))) # force to show all x labels
ax.set_yticks(np.arange(len(df.columns))) # force to show all y labels
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.columns)
fig.colorbar(im)
plt.show()

Based on the correlation matrix above, we chose to interpret the following variables:

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT - % lower status of the population Age - proportion of owner-occupied units built prior to 1940

  • LSTAT has a positive correlation with CRIM, as the crime rate by town(CRIM) increases relates to the lower stataus of the population in that part of town There is also a positive correlation between LSTAT and AGE. The percent of lower status of the population is high in areas where proportion of owner-occupied units built prior to 1940 are also high. Typically, neighborhoods with old housing tend to have a lower status of population since they are typically cheaper. Where as newer homes tend to cater to those with higher status of the population because they are able to afford this type of housing.
  • Conversely, B and AGE are inversely correlated since proportion of homes built prior to 1940 tend to have a low percentage of proportion of blacks by town. For example, neighborhoods that were prestigious prior to 1940 and still carry that prestige today still have a low percentage of proportion of black people living in these neighborhoods.
  • LSTAT is inversely correlated to ZN, RM, DIS. As the % of lower status of the population increases, the ZN, RM, DIS drop. For example, the average number of rooms in houses(RM) in lower status neighborhoods is low compared to number of rooms in houses in more prestigious neighborhoods.

Profile Dataset

In [8]:
profile = ProfileReport(df, title="Boston Features Profile")
profile.to_notebook_iframe()



Step 2:

Randomize 1%, 5% 10%, 20%, 33%, and 50% missing values in a single column

In [5]:
df_dict = {} # will hold dataframes with various % of missing values
percentages = [0.01, 0.05, 0.1, 0.2, 0.33, 0.5]
column = 'AGE'

for p in percentages:
    in_sample = df.sample(frac=p, random_state=42) # sample copy dataframe with fraction at p
    in_sample[column] = np.NaN # set selected column values to NaN

    out_sample = df[~df.isin(in_sample)].dropna()
    
    df_dict[str(int(p * 100))] = pd.concat([out_sample, in_sample]).sort_index() 
    
In [6]:
for key in df_dict.keys():
    print(f"df_dict[\"{key}\"]: '{column}' column, in {int(key) / 100}%",
          f"sampled dataframe, has {df_dict[key].isnull().sum()[column]}",
          f"of {df_dict[key].shape[0]} missing values")
          
df_dict["1"]: 'AGE' column, in 0.01% sampled dataframe, has 5 of 506 missing values
df_dict["5"]: 'AGE' column, in 0.05% sampled dataframe, has 25 of 506 missing values
df_dict["10"]: 'AGE' column, in 0.1% sampled dataframe, has 51 of 506 missing values
df_dict["20"]: 'AGE' column, in 0.2% sampled dataframe, has 101 of 506 missing values
df_dict["33"]: 'AGE' column, in 0.33% sampled dataframe, has 167 of 506 missing values
df_dict["50"]: 'AGE' column, in 0.5% sampled dataframe, has 253 of 506 missing values

Baseline Model w/o Missing Values

In [7]:
print("Baseline (zero NAs):")
reg = LinearRegression(normalize=True, fit_intercept=True)
reg.fit(X_train, y_train)

#R^2
print("R^2 (Train):", reg.score(X_train, y_train))
print("R^2 (Test):", reg.score(X_test, y_test))
# MSE
pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
Baseline (zero NAs):
R^2 (Train): 0.7196488510369425
R^2 (Test): 0.7522613664970778
MSE: 24.18641019530455

Let's get the fitted values from the model, since sk-learn sucks and doesn't give them to you in the call to .fit()

In [117]:
fitted = fit.predict(X_train)

Plot of Fitted Values v Actual (baseline)

In [118]:
fig, ax = plt.subplots(figsize=(20, 10), nrows=1, ncols=1)
p = pd.DataFrame({
    'index': range(len(fitted)),
    'predicted_values': fitted,
    'actual_values': y_train
}).set_index('index').plot(kind='line', ax=ax)

p.lines[0].set_linestyle('--')
p.set_xlabel('Observation')
p.set_ylabel('Median Home Price')
p.set_title('Baseline Fitted vs Actual')
Out[118]:
Text(0.5, 1.0, 'Baseline Fitted vs Actual')

Question 2: Impute data and compare

Compare loss and goodness of fit to your baseline

In [8]:
print('Imputation with median value')
print()

for key in df_dict.keys():
    print(key, '%')
    print(f"Synthetic missing values (at random) for {key}% of '{column}' column.")
    _X_train, _X_test, _y_train, _y_test = train_test_split(df_dict[key].copy(), df_targets.target, test_size=0.33, random_state=7337)
    
    # calculate median AGE for imputation
    median = _X_train[column].median()
    print(f"Imputed with median, {median}")
    
    # impute NAs
    _X_train[column].fillna(median, inplace=True)
    
    # build and fit model
    reg = LinearRegression(normalize=True)
    reg.fit(_X_train, _y_train)
    
    # imput validation set (test) with same value
    _X_test[column].fillna(median, inplace=True)
    
    # R^2
    print("R^2:", reg.score(_X_test, _y_test))
    
    # MSE
    pred = reg.predict(_X_test)
    print("MSE:", mean_squared_error(_y_test, pred))
    
    print('-'*20)
    print()
Imputation with median value

1 %
Synthetic missing values (at random) for 1% of 'AGE' column.
Imputed with median, 77.5
R^2: 0.7521675437497414
MSE: 24.1955699917426
--------------------

5 %
Synthetic missing values (at random) for 5% of 'AGE' column.
Imputed with median, 76.80000000000001
R^2: 0.7520298402151848
MSE: 24.209013814068722
--------------------

10 %
Synthetic missing values (at random) for 10% of 'AGE' column.
Imputed with median, 76.7
R^2: 0.7522853769122572
MSE: 24.184066088766635
--------------------

20 %
Synthetic missing values (at random) for 20% of 'AGE' column.
Imputed with median, 76.9
R^2: 0.7522811866708815
MSE: 24.18447517674494
--------------------

33 %
Synthetic missing values (at random) for 33% of 'AGE' column.
Imputed with median, 76.95
R^2: 0.7524070279001452
MSE: 24.172189456316847
--------------------

50 %
Synthetic missing values (at random) for 50% of 'AGE' column.
Imputed with median, 76.0
R^2: 0.7510828821764767
MSE: 24.30146413252769
--------------------

Compared to the baseline, the six models with varying degree of missing values from the 'AGE' column had slightly worse MSE and $R^2$ when imputing the newly introduced NAs with the median value for that column. The median value was selected over the mean due to its robustness against outlier values. That being said, mean was also tested with little effect.

Imputation seemed to be a successful approach to achieve comparable losses to baseline model with no missing values in 'AGE'.

Baseline Parameters

Unfortunately sklearn is not R and does not give confidence intervals for parameter estimates, but we can still see the effect that imputation has on the structure of the model by comparing how the parameter estimates change as a result of our imputation scheme.

In [111]:
keys = list(X_train.columns)
keys.append('Intercept')
params = list(fit.coef_)
params.append(fit.intercept_)
baseline_params = pd.DataFrame(zip(keys, params)).rename(columns={
    0: 'Field',
    1: 'Baseline Parameter Estimate'
})
baseline_params
Out[111]:
Field Baseline Parameter Estimate
0 CRIM -0.069434
1 ZN 0.037477
2 INDUS 0.030331
3 CHAS 4.153992
4 NOX -19.290708
5 RM 3.651212
6 AGE -0.000488
7 DIS -1.353645
8 RAD 0.249161
9 TAX -0.006062
10 PTRATIO -0.945714
11 B 0.009307
12 LSTAT -0.522654
13 Intercept 35.608734

Step 3:

Take two columns and create data “Missing at Random” when controlled for a third variable (i.e., if Variable Z is > 30, then Variables X, Y are randomly missing). Use your preferred imputation method to fill in 10%, 20% and 30% of your missing data.

The function defined below will take any arbitrary combination of columns to be imputed based on a criterion specified for another column. An impution function will be applied to a set of records based on the number of records that meet the criterion and a sample rate. This could be used for a much more exhaustive search for impution schemes if one were so inclined.

For this test - B and LSTAT will be imputed to the mean wherever AGE is less than 40.

In [119]:
samples = take_two(X_train,
                   null_cols=['B', 'LSTAT'],
                   test_col='AGE',
                   cond='<60',
                   samples=np.linspace(0, 1, 100),
                   impute_func=np.mean)
Percent of imputed values:0.00%
B        355.621091
LSTAT     12.619853
dtype: float64
<function mean at 0x0000020777C94378>
0.0
Percent of imputed values:0.29%
B        355.506183
LSTAT     12.635178
dtype: float64
<function mean at 0x0000020777C94378>
0.010101010101010102
Percent of imputed values:0.88%
B        355.287054
LSTAT     12.674762
dtype: float64
<function mean at 0x0000020777C94378>
0.020202020202020204
Percent of imputed values:1.18%
B        355.228866
LSTAT     12.660716
dtype: float64
<function mean at 0x0000020777C94378>
0.030303030303030304
Percent of imputed values:1.47%
B        355.036228
LSTAT     12.730329
dtype: float64
<function mean at 0x0000020777C94378>
0.04040404040404041
Percent of imputed values:1.77%
B        355.067688
LSTAT     12.675706
dtype: float64
<function mean at 0x0000020777C94378>
0.05050505050505051
Percent of imputed values:2.36%
B        354.714139
LSTAT     12.718731
dtype: float64
<function mean at 0x0000020777C94378>
0.06060606060606061
Percent of imputed values:2.65%
B        354.824515
LSTAT     12.752697
dtype: float64
<function mean at 0x0000020777C94378>
0.07070707070707072
Percent of imputed values:2.95%
B        354.438146
LSTAT     12.762979
dtype: float64
<function mean at 0x0000020777C94378>
0.08080808080808081
Percent of imputed values:3.54%
B        354.375382
LSTAT     12.816636
dtype: float64
<function mean at 0x0000020777C94378>
0.09090909090909091
Percent of imputed values:3.83%
B        354.082822
LSTAT     12.761810
dtype: float64
<function mean at 0x0000020777C94378>
0.10101010101010102
Percent of imputed values:4.13%
B        354.112892
LSTAT     12.863754
dtype: float64
<function mean at 0x0000020777C94378>
0.11111111111111112
Percent of imputed values:4.72%
B        353.881486
LSTAT     12.854644
dtype: float64
<function mean at 0x0000020777C94378>
0.12121212121212122
Percent of imputed values:5.01%
B        353.813882
LSTAT     12.843106
dtype: float64
<function mean at 0x0000020777C94378>
0.13131313131313133
Percent of imputed values:5.31%
B        353.792523
LSTAT     12.954829
dtype: float64
<function mean at 0x0000020777C94378>
0.14141414141414144
Percent of imputed values:5.60%
B        353.490406
LSTAT     12.906563
dtype: float64
<function mean at 0x0000020777C94378>
0.15151515151515152
Percent of imputed values:6.19%
B        353.506887
LSTAT     12.868239
dtype: float64
<function mean at 0x0000020777C94378>
0.16161616161616163
Percent of imputed values:6.49%
B        353.124227
LSTAT     12.899842
dtype: float64
<function mean at 0x0000020777C94378>
0.17171717171717174
Percent of imputed values:6.78%
B        353.212563
LSTAT     13.015570
dtype: float64
<function mean at 0x0000020777C94378>
0.18181818181818182
Percent of imputed values:7.37%
B        352.729936
LSTAT     13.002611
dtype: float64
<function mean at 0x0000020777C94378>
0.19191919191919193
Percent of imputed values:7.67%
B        352.682460
LSTAT     13.023834
dtype: float64
<function mean at 0x0000020777C94378>
0.20202020202020204
Percent of imputed values:7.96%
B        352.809840
LSTAT     13.099231
dtype: float64
<function mean at 0x0000020777C94378>
0.21212121212121213
Percent of imputed values:8.26%
B        352.148071
LSTAT     13.168971
dtype: float64
<function mean at 0x0000020777C94378>
0.22222222222222224
Percent of imputed values:8.85%
B        352.685793
LSTAT     13.015275
dtype: float64
<function mean at 0x0000020777C94378>
0.23232323232323235
Percent of imputed values:9.14%
B        352.535974
LSTAT     13.058669
dtype: float64
<function mean at 0x0000020777C94378>
0.24242424242424243
Percent of imputed values:9.44%
B        351.952573
LSTAT     13.121107
dtype: float64
<function mean at 0x0000020777C94378>
0.25252525252525254
Percent of imputed values:10.03%
B        351.856393
LSTAT     12.995016
dtype: float64
<function mean at 0x0000020777C94378>
0.26262626262626265
Percent of imputed values:10.32%
B        351.525888
LSTAT     13.273355
dtype: float64
<function mean at 0x0000020777C94378>
0.27272727272727276
Percent of imputed values:10.62%
B        351.824125
LSTAT     13.174752
dtype: float64
<function mean at 0x0000020777C94378>
0.2828282828282829
Percent of imputed values:10.91%
B        351.966159
LSTAT     13.272119
dtype: float64
<function mean at 0x0000020777C94378>
0.29292929292929293
Percent of imputed values:11.50%
B        351.096333
LSTAT     13.351900
dtype: float64
<function mean at 0x0000020777C94378>
0.30303030303030304
Percent of imputed values:11.80%
B        351.413278
LSTAT     13.242475
dtype: float64
<function mean at 0x0000020777C94378>
0.31313131313131315
Percent of imputed values:12.09%
B        351.190604
LSTAT     13.267181
dtype: float64
<function mean at 0x0000020777C94378>
0.32323232323232326
Percent of imputed values:12.68%
B        350.971385
LSTAT     13.353615
dtype: float64
<function mean at 0x0000020777C94378>
0.33333333333333337
Percent of imputed values:12.98%
B        350.449492
LSTAT     13.354305
dtype: float64
<function mean at 0x0000020777C94378>
0.3434343434343435
Percent of imputed values:13.27%
B        350.481565
LSTAT     13.371701
dtype: float64
<function mean at 0x0000020777C94378>
0.3535353535353536
Percent of imputed values:13.86%
B        350.582671
LSTAT     13.517192
dtype: float64
<function mean at 0x0000020777C94378>
0.36363636363636365
Percent of imputed values:14.16%
B        350.316770
LSTAT     13.386117
dtype: float64
<function mean at 0x0000020777C94378>
0.37373737373737376
Percent of imputed values:14.45%
B        350.109172
LSTAT     13.465207
dtype: float64
<function mean at 0x0000020777C94378>
0.38383838383838387
Percent of imputed values:14.75%
B        349.756609
LSTAT     13.531696
dtype: float64
<function mean at 0x0000020777C94378>
0.393939393939394
Percent of imputed values:15.34%
B        349.297805
LSTAT     13.568084
dtype: float64
<function mean at 0x0000020777C94378>
0.4040404040404041
Percent of imputed values:15.63%
B        349.376713
LSTAT     13.539545
dtype: float64
<function mean at 0x0000020777C94378>
0.4141414141414142
Percent of imputed values:15.93%
B        349.549333
LSTAT     13.503263
dtype: float64
<function mean at 0x0000020777C94378>
0.42424242424242425
Percent of imputed values:16.52%
B        349.701272
LSTAT     13.619965
dtype: float64
<function mean at 0x0000020777C94378>
0.43434343434343436
Percent of imputed values:16.81%
B        348.688298
LSTAT     13.585177
dtype: float64
<function mean at 0x0000020777C94378>
0.4444444444444445
Percent of imputed values:17.11%
B        348.498897
LSTAT     13.572633
dtype: float64
<function mean at 0x0000020777C94378>
0.4545454545454546
Percent of imputed values:17.40%
B        349.162107
LSTAT     13.554357
dtype: float64
<function mean at 0x0000020777C94378>
0.4646464646464647
Percent of imputed values:17.99%
B        348.084317
LSTAT     13.744245
dtype: float64
<function mean at 0x0000020777C94378>
0.4747474747474748
Percent of imputed values:18.29%
B        347.984946
LSTAT     13.703213
dtype: float64
<function mean at 0x0000020777C94378>
0.48484848484848486
Percent of imputed values:18.58%
B        348.535036
LSTAT     13.662681
dtype: float64
<function mean at 0x0000020777C94378>
0.494949494949495
Percent of imputed values:19.17%
B        348.299927
LSTAT     13.805365
dtype: float64
<function mean at 0x0000020777C94378>
0.5050505050505051
Percent of imputed values:19.47%
B        347.883333
LSTAT     13.708535
dtype: float64
<function mean at 0x0000020777C94378>
0.5151515151515152
Percent of imputed values:19.76%
B        347.957941
LSTAT     13.867243
dtype: float64
<function mean at 0x0000020777C94378>
0.5252525252525253
Percent of imputed values:20.35%
B        346.763741
LSTAT     13.860815
dtype: float64
<function mean at 0x0000020777C94378>
0.5353535353535354
Percent of imputed values:20.65%
B        346.855465
LSTAT     13.940929
dtype: float64
<function mean at 0x0000020777C94378>
0.5454545454545455
Percent of imputed values:20.94%
B        346.839701
LSTAT     13.776157
dtype: float64
<function mean at 0x0000020777C94378>
0.5555555555555556
Percent of imputed values:21.24%
B        346.808390
LSTAT     13.836966
dtype: float64
<function mean at 0x0000020777C94378>
0.5656565656565657
Percent of imputed values:21.83%
B        346.203774
LSTAT     13.923434
dtype: float64
<function mean at 0x0000020777C94378>
0.5757575757575758
Percent of imputed values:22.12%
B        346.234583
LSTAT     13.934735
dtype: float64
<function mean at 0x0000020777C94378>
0.5858585858585859
Percent of imputed values:22.42%
B        345.851445
LSTAT     13.939620
dtype: float64
<function mean at 0x0000020777C94378>
0.595959595959596
Percent of imputed values:23.01%
B        345.821341
LSTAT     14.181149
dtype: float64
<function mean at 0x0000020777C94378>
0.6060606060606061
Percent of imputed values:23.30%
B        345.477962
LSTAT     14.102731
dtype: float64
<function mean at 0x0000020777C94378>
0.6161616161616162
Percent of imputed values:23.60%
B        345.712819
LSTAT     14.119382
dtype: float64
<function mean at 0x0000020777C94378>
0.6262626262626263
Percent of imputed values:23.89%
B        345.110659
LSTAT     14.216318
dtype: float64
<function mean at 0x0000020777C94378>
0.6363636363636365
Percent of imputed values:24.48%
B        345.278477
LSTAT     14.077461
dtype: float64
<function mean at 0x0000020777C94378>
0.6464646464646465
Percent of imputed values:24.78%
B        345.195765
LSTAT     14.111804
dtype: float64
<function mean at 0x0000020777C94378>
0.6565656565656566
Percent of imputed values:25.07%
B        344.836969
LSTAT     14.341654
dtype: float64
<function mean at 0x0000020777C94378>
0.6666666666666667
Percent of imputed values:25.66%
B        343.994802
LSTAT     14.269683
dtype: float64
<function mean at 0x0000020777C94378>
0.6767676767676768
Percent of imputed values:25.96%
B        344.566733
LSTAT     14.323227
dtype: float64
<function mean at 0x0000020777C94378>
0.686868686868687
Percent of imputed values:26.25%
B        343.97116
LSTAT     14.34608
dtype: float64
<function mean at 0x0000020777C94378>
0.696969696969697
Percent of imputed values:26.84%
B        343.671492
LSTAT     14.498669
dtype: float64
<function mean at 0x0000020777C94378>
0.7070707070707072
Percent of imputed values:27.14%
B        343.676802
LSTAT     14.271377
dtype: float64
<function mean at 0x0000020777C94378>
0.7171717171717172
Percent of imputed values:27.43%
B        343.526545
LSTAT     14.379431
dtype: float64
<function mean at 0x0000020777C94378>
0.7272727272727273
Percent of imputed values:27.73%
B        342.643102
LSTAT     14.453878
dtype: float64
<function mean at 0x0000020777C94378>
0.7373737373737375
Percent of imputed values:28.32%
B        342.710617
LSTAT     14.623580
dtype: float64
<function mean at 0x0000020777C94378>
0.7474747474747475
Percent of imputed values:28.61%
B        342.907314
LSTAT     14.619339
dtype: float64
<function mean at 0x0000020777C94378>
0.7575757575757577
Percent of imputed values:28.91%
B        341.981743
LSTAT     14.514938
dtype: float64
<function mean at 0x0000020777C94378>
0.7676767676767677
Percent of imputed values:29.50%
B        341.968536
LSTAT     14.718745
dtype: float64
<function mean at 0x0000020777C94378>
0.7777777777777778
Percent of imputed values:29.79%
B        342.060336
LSTAT     14.706513
dtype: float64
<function mean at 0x0000020777C94378>
0.787878787878788
Percent of imputed values:30.09%
B        341.429198
LSTAT     14.711899
dtype: float64
<function mean at 0x0000020777C94378>
0.797979797979798
Percent of imputed values:30.38%
B        341.403220
LSTAT     14.803305
dtype: float64
<function mean at 0x0000020777C94378>
0.8080808080808082
Percent of imputed values:30.97%
B        341.156154
LSTAT     14.801581
dtype: float64
<function mean at 0x0000020777C94378>
0.8181818181818182
Percent of imputed values:31.27%
B        341.229185
LSTAT     14.702403
dtype: float64
<function mean at 0x0000020777C94378>
0.8282828282828284
Percent of imputed values:31.56%
B        340.633233
LSTAT     14.929310
dtype: float64
<function mean at 0x0000020777C94378>
0.8383838383838385
Percent of imputed values:32.15%
B        340.220826
LSTAT     14.799609
dtype: float64
<function mean at 0x0000020777C94378>
0.8484848484848485
Percent of imputed values:32.45%
B        339.813755
LSTAT     14.958166
dtype: float64
<function mean at 0x0000020777C94378>
0.8585858585858587
Percent of imputed values:32.74%
B        340.004693
LSTAT     14.901228
dtype: float64
<function mean at 0x0000020777C94378>
0.8686868686868687
Percent of imputed values:33.04%
B        339.137753
LSTAT     15.133568
dtype: float64
<function mean at 0x0000020777C94378>
0.8787878787878789
Percent of imputed values:33.63%
B        338.977156
LSTAT     15.126756
dtype: float64
<function mean at 0x0000020777C94378>
0.888888888888889
Percent of imputed values:33.92%
B        338.400223
LSTAT     15.076116
dtype: float64
<function mean at 0x0000020777C94378>
0.8989898989898991
Percent of imputed values:34.22%
B        338.114574
LSTAT     15.247309
dtype: float64
<function mean at 0x0000020777C94378>
0.9090909090909092
Percent of imputed values:34.81%
B        338.047014
LSTAT     15.217828
dtype: float64
<function mean at 0x0000020777C94378>
0.9191919191919192
Percent of imputed values:35.10%
B        337.212364
LSTAT     15.274409
dtype: float64
<function mean at 0x0000020777C94378>
0.9292929292929294
Percent of imputed values:35.40%
B        337.872511
LSTAT     15.302237
dtype: float64
<function mean at 0x0000020777C94378>
0.9393939393939394
Percent of imputed values:35.99%
B        337.386544
LSTAT     15.397696
dtype: float64
<function mean at 0x0000020777C94378>
0.9494949494949496
Percent of imputed values:36.28%
B        337.001296
LSTAT     15.343889
dtype: float64
<function mean at 0x0000020777C94378>
0.9595959595959597
Percent of imputed values:36.58%
B        336.894000
LSTAT     15.404233
dtype: float64
<function mean at 0x0000020777C94378>
0.9696969696969697
Percent of imputed values:36.87%
B        336.494720
LSTAT     15.379159
dtype: float64
<function mean at 0x0000020777C94378>
0.9797979797979799
Percent of imputed values:37.46%
B        336.079198
LSTAT     15.484953
dtype: float64
<function mean at 0x0000020777C94378>
0.98989898989899
Percent of imputed values:37.76%
B        335.790948
LSTAT     15.541564
dtype: float64
<function mean at 0x0000020777C94378>
1.0
In [78]:
X_train[X_train.AGE < 60].shape[0]/X_train.shape[0]
Out[78]:
0.3775811209439528

Setting a 10/20/30% sample rate for the rows that match our criteria equates to imputing 3.83%/7.67%/11.21% of our values to the target scheme (in this case, the mean). Note that the mean for B declines as the sample rate increases (354.18 - 351.3) whereas LSTAT increases as the sample rate increases (12.84-13.26).

Refitting LR model to new data frames

In [157]:
for rate, data in samples.items():
    print(f'{rate:.2%} sample rate results:')
    reg = LinearRegression(normalize=True, fit_intercept=True)
    fit = reg.fit(data, y_train)
    train_score = reg.score(X_train, y_train)
    test_score = reg.score(X_test, y_test)
    print(
        f'Train: {train_score:.6f} is {" better than " if train_score > baseline_score else " worse than " }baseline R2'
    )
    print(
        f'Test: {test_score:.6f} is {" better than " if test_score > baseline_score_test else " worse than " }baseline R2'
    )
10.00% sample rate results:
Train: 0.719177 is  worse than baseline R2
Test: 0.754730 is  better than baseline R2
20.00% sample rate results:
Train: 0.717919 is  worse than baseline R2
Test: 0.752540 is  better than baseline R2
30.00% sample rate results:
Train: 0.717504 is  worse than baseline R2
Test: 0.757696 is  better than baseline R2

Question 3:

In each case [10%, 20%, 30%] perform a fit with the imputed data and compare the loss and goodness of fit to your baseline. [Note: you should have (9) models to compare against your baseline at this point.]

Using the mean impution scheme described above, all 3 models were slightly better than our baseline model on the testing set - on the training set, they were all slightly worse. The best of the three models (out of sample) was the 30% sample rate model (0.75696 vs 0.75226 $R^2$)

For fun, let's try the same example with the median

In [158]:
samples = take_two(X_train,
                   null_cols=['B', 'LSTAT'],
                   test_col='AGE',
                   cond='<60',
                   impute_func=np.median)
Percent of imputed values:3.83%
[390.945  11.465]
<function median at 0x0000027E46C52C80>
0.1
Percent of imputed values:7.67%
[390.74  11.98]
<function median at 0x0000027E46C52C80>
0.2
Percent of imputed values:11.21%
[390.86  12.33]
<function median at 0x0000027E46C52C80>
0.3
In [159]:
for rate, data in samples.items():
    print(f'{rate:.2%} sample rate results:')
    reg = LinearRegression(normalize=True, fit_intercept=True)
    fit = reg.fit(data, y_train)
    train_score = reg.score(X_train, y_train)
    test_score = reg.score(X_test, y_test)
    print(
        f'Train: {train_score:.6f} is {" better than " if train_score > baseline_score else " worse than " }baseline R2'
    )
    print(
        f'Test: {test_score:.6f} is {" better than " if test_score > baseline_score_test else " worse than " }baseline R2'
    )
10.00% sample rate results:
Train: 0.719102 is  worse than baseline R2
Test: 0.755050 is  better than baseline R2
20.00% sample rate results:
Train: 0.717601 is  worse than baseline R2
Test: 0.754133 is  better than baseline R2
30.00% sample rate results:
Train: 0.715893 is  worse than baseline R2
Test: 0.753947 is  better than baseline R2

For fun, let's try the same example with the median but also change the cutoff from 60 years to 85 years. We wrote the code to be experimental after all!

In [160]:
samples = take_two(X_train,
                   null_cols=['B', 'LSTAT'],
                   test_col='AGE',
                   cond='<85',
                   impute_func=np.median)
Percent of imputed values:5.90%
[390.95  11.45]
<function median at 0x0000027E46C52C80>
0.1
Percent of imputed values:11.50%
[390.925  11.73 ]
<function median at 0x0000027E46C52C80>
0.2
Percent of imputed values:17.40%
[390.565  12.125]
<function median at 0x0000027E46C52C80>
0.3
In [161]:
for rate, data in samples.items():
    print(f'{rate:.2%} sample rate results:')
    reg = LinearRegression(normalize=True, fit_intercept=True)
    fit = reg.fit(data, y_train)
    train_score = reg.score(X_train, y_train)
    test_score = reg.score(X_test, y_test)
    print(
        f'Train: {train_score:.6f} is {" better than " if train_score > baseline_score else " worse than " }baseline R2'
    )
    print(
        f'Test: {test_score:.6f} is {" better than " if test_score > baseline_score_test else " worse than " }baseline R2'
    )
10.00% sample rate results:
Train: 0.719378 is  worse than baseline R2
Test: 0.754127 is  better than baseline R2
20.00% sample rate results:
Train: 0.717881 is  worse than baseline R2
Test: 0.755723 is  better than baseline R2
30.00% sample rate results:
Train: 0.715116 is  worse than baseline R2
Test: 0.759534 is  better than baseline R2

Using the median impution scheme described above, all 3 models were slightly better than our baseline model on the testing set - on the training set, they were all slightly worse. The best of the three models (out of sample) was the 30% sample rate model (0.759534 vs 0.75226 $R^2$). This particular scheme (median/>85 cutoff) produced the best results in this analysis. Note that this also had the most samples imputed - 17.40% of the data.

Step 4:

Create a “Missing Not at Random” pattern in which 25% of the data is missing for a single column.

Let's cut off the top 25% of AGE since it seems to be very skewed and then impute the missing data to mean.

In [177]:
cuts, bins = pd.qcut(X_train['AGE'], q=[0, .25, .5, .75, 1.], labels=['Q1', 'Q2', 'Q3', 'Q4'], retbins=True)
In [179]:
bins
Out[179]:
array([  2.9 ,  42.15,  77.7 ,  94.05, 100.  ])

Cutting out the top 25% of the Age range leaves houses that are less than 94.05 years old.

In [181]:
sample = X_train.copy()
remove_index = cuts[cuts=='Q4'].index
sample.loc[remove_index, 'AGE'] = None
sample_mean = sample.loc[sample.index.difference(remove_index), 'AGE'].mean()
sample.loc[remove_index, 'AGE'] = sample_mean
In [183]:
sample['AGE'].describe()
Out[183]:
count    339.000000
mean      57.982283
std       23.205433
min        2.900000
25%       42.150000
50%       57.982283
75%       77.500000
max       94.000000
Name: AGE, dtype: float64

Question 4:

Perform a fit with the imputed data [25%] and compare the loss and goodness of fit to your baseline. [Note: you should have (10) models to compare against your baseline at this point.]

In [186]:
reg = LinearRegression(normalize=True, fit_intercept=True)
fit = reg.fit(sample, y_train)
train_score = reg.score(X_train, y_train)
test_score = reg.score(X_test, y_test)
print(
    f'Train: {train_score:.6f} is {" better than " if train_score > baseline_score else " worse than " }baseline R2'
)
print(
    f'Test: {test_score:.6f} is {" better than " if test_score > baseline_score_test else " worse than " }baseline R2'
)
Train: 0.716569 is  worse than baseline R2
Test: 0.757745 is  better than baseline R2

Step 5:

Describe your imputation approach and summarize your findings. What impact did the missing data have on your baseline model’s performance?

Baseline Score = 0.75226

In Step 3, we decided to impute variables "B" and "LSTAT" to the mean whererver "AGE" is less than 60. The take_two function will take any arbitrary combination of columns to be imputed based on a criterion specified for another column. The specified impution function will be applied to a set of records based on the number of records that meet the criterion and a random sample based on the sample rate. Using the mean imputation scheme described above, all 3 models (in step 3) were slightly better than our baseline model on the testing set - on the training set, they were all slightly worse. The best of the three models (out of sample) was the 30% sample rate model (0.75696 vs 0.75226 $𝑅^2$ )

Given the definitions of these variables (chosen specifically for their social context - see data dictionary) the simplest interpretation of these trends is that as we impute more values of homes with ages less than 60, the concetration of African Americans decreases - this imples that older homes (>60) are located in areas with lower concentrations of African Americans. Conversely, the concentration of those considered "lower-class" increased as we imputed values of homes with ages less than 60. Again, this points to a higher concetration of lower-class individuals with homes over 60 years in age.

After experimenting with the mean imputation scheme, we decided to try the same scheme using the median[10%, 20%, 30%]. We found that all 3 models were slightly better than our baseline model on the testing set - on the training set, they were all slightly worse. The best of the three models (out of sample) was the 30% sample rate model (0.759534 vs 0.75226 $𝑅^2$ ). This particular scheme (median/>85 cutoff) produced the best results in this analysis. Note that this also had the most samples imputed - 17.40% of the data.

In step 4, we decided to cut off the top 25% of AGE since it seems to be very skewed and then impute the missing data to mean. Cutting out the top 25% of the Age range leaves houses that are less than 94.05 years old.

In [ ]: